Optimal outlier removal in high-dimensional spaces
نویسندگان
چکیده
We study the problem of finding an outlier-free subset of a set of points (or a probability distribution) in n-dimensional Euclidean space. As in [BFKV 99], a point x is defined to be a β-outlier if there exists some direction w in which its squared distance from the mean along w is greater than β times the average squared distance from the mean along w. Our main theorem is that for any ǫ > 0, there exists a (1− ǫ) fraction of the original distribution that has no O(n ǫ (b+log n ǫ ))-outliers, improving on the previous bound of O(nb/ǫ). This is asymptotically the best possible, as shown by a matching lower bound. The theorem is constructive, and results in a 1 1−ǫ approximation to the following optimization problem: given a distribution μ (i.e. the ability to sample from it), and a parameter ǫ > 0, find the minimum β for which there exists a subset of probability at least (1− ǫ) with no β-outliers.
منابع مشابه
RNN (Reverse Nearest Neighbour) in Unproven Reserve Based Outlier Discovery
Outlier detection refers to task of identifying patterns. They don’t conform establish regular behavior. Outlier detection in highdimensional data presents various challenges resulting from the “curse of dimensionality”. The current view is that distance concentration that is tendency of distances in high-dimensional data to become in discernible making distance-based methods label all points a...
متن کاملFeature Extraction for Outlier Detection in High-Dimensional Spaces
This work addresses the problem of feature extraction for boosting the performance of outlier detectors in high-dimensional spaces. Recent years have observed the prominence of multidimensional data on which traditional detection techniques usually fail to work as expected due to the curse of dimensionality. This paper introduces an efficient feature extraction method which brings nontrivial im...
متن کاملRobust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data
Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...
متن کاملA Robust Method for Detecting DB-Outliers from High Dimensional Datasets
Outlier detection is a popular technique that can be utilized in many modern applications like financial analysis and fraud detection. As data description becomes complex, operated datasets’ dimensionalities keep monotone increasing. However, current researches find that it is extremely difficult to pick out outliers directly from high dimensional datasets owing to the curse of dimensionality. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Comput. Syst. Sci.
دوره 68 شماره
صفحات -
تاریخ انتشار 2004